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BSB  RecaM  Operation 


■pall-lk-a-HHal-HHM-HPI 


•  X(t+1)  and  X(t)  are  N  dimensional  real  vectors; 

•  A  is  an  NxN  connection  matrix; 
is  a  scalar  constant  feedback  factor; 
is  an  inhibition  decay  constant; 

•  v  is  a  nonzero  constant  if  there  is  a  need  to  maintain 
the  input  stimulation; 

•  X(0)  is  the  input  stimulation;  M  I 

•  S()  is  the  “squash”  function  Ihl  - 1  ■  |  -!■■■! 

V1  ■ 
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BSB  Training  Operation 


In  Training,  the  NxN  connection  matrix  A  is  modified  as 


A  A  =  /r  *  (X  -  AX)  ©  X 
A  =  A  +  A  A 


•  X  is  the  normalized  input  training  pattern; 

•  Ir  is  the  Learning  rate; 

•  ®  is  the  outer  product  of  two  vectors; 
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Application 


Pattern  recognition 


10x10  black  &  white  patterns 


Input/state  vector  structure 
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Total  of  20  line  patterns  ->  20  tag  entries 
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Application  1 


Line  Patterns 
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Application  2:  Alphabetic  Patterns 


Application  3:  Symbol  Patterns 


Results 


Training:  500  iterations 
Recall: 


Number  of  iterations  for 
successful  recalls 

Number  of  iterations  for 
unsuccessful  recalls 

Exact  training  pattern 
without  tag 

8  ~  11 

N/A 

Similar/partial  pattern 
without  tag 

11  ~  41 

16  ~  40 
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Generic  Massively  Parallel  Cognitive 

Computing  System 


Multi-Layer  Heterogeneous 
Interconnect  Network 


•  • 


PE:  Processing  Element 
LS:  Local  Store  (local  memory) 
GM:  Global  Memory 
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\,^Overall  Processed  Element  Architectur 
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FPGA  Data  Path  of  BSB  128  Recal 


index 

cntr 


I  I  I 
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FPGA  Resource  Utilization 


FPGA  device:  Xilinx  Virtex  II  Pro  XC2VP70 
Target  clock  frequency:  100  MHz 


Resources 

Recall  Design 

Training  Design 

Block  Rams 

129  (of  288) 

129  (of  288) 

18-bit  Multipliers 

130  (of  328) 

130  (of  328) 

LUTs 

6127(8%) 

2271 6  (33%) 
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FPGA  Results  and  Comparison 


Software:  2.0GHz  Intel  Xeon  processor,  2GB  RAM 

Programmed  and  compiled  with  Intel  Math  Kernel  Library  9.0 

Hardware:  90  MHz  clock  frequency 


Implementation 

Time  per  Recall  *(ps) 

Equivalent  FLOPS 

Software 

12.5 

~2.6G 

Hardware 

1.69 

-19.4G 

*  Average  of  1 ,000,000  recalls 

Speed  up  by  hardware:  7.4X 
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Software  Solution  on  PS3 


PlayStation  3 


■  Yellow  dog  Linux  5.0 

■  IBM  CELL  SDK  2.0 

■  Cell  processor 

■  256  MB  RAM 

■  60  GB  hard  drive 

■  Gigabit  Ethernet 

■150  Gflops  Single  Precision 
Peak 


$499 
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Cell  Broadband  Engine  Processor 


BEI 

Cell  Broadband  Engine  Interface 

MIC 

Memory  Interface  Controller 

EIB 

Element  Interconnect  Bus 

PPE 

PowerPC  Processor  Element 

FlexlO 

Rambus  FlexlO  Bus 

RAM 

Resource  Allocation  Management 

IOIF 

I/O  Interface 

SPE 

Synergistic  Processor  Element 

XIO 

Rambus  XDR  I/O  (XIO)  cell 
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Storage  and  I/O  Interface 


Channel  Interface  Local  Storage  Main  Storage 

(channel  commands)  (local-address  space)  (effective-address  space) 


DMA 

Direct  Memory  Access 

PPE 

PowerPC  Processor  Element 

EIB 

Element  Interconnect  Bus 

PPSS 

PowerPC  Processor  Storage  Subsystem 

LS 

Local  Storage 

PPU 

PowerPC  Processor  Unit 

MFC 

Memory  Flow  Controller 

SPE 

Synergistic  Processor  Element 

MMIO 

Memory-Mapped  I/O 

SPU 

Synergistic  Processor  Unit 

Key  Performance  Numbers 


•  Clock  Frequency 

-  3.2  GHz 

•  Peak  Performance 

-  25.6  GFLOPS  per  SPE 

•  Main  Memory 

—  256  MB  Rambus  XDR  in  PS3  and  blades 

—  25.6  MB/s  total  memory  bandwidth 
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Computation  Flow 


1 0  Repetitions 


Star^ 


DMA 

M_V 

SUM 

MA 

SQ 

Don^ 


(a)  Without  double  buffering 


DMA:  Fetch  coefficient  matrix  (128x128)  from  main  memory  to  local  store 

M_V:  Multiplication  of  coefficient  matrix  and  state  vector 

SUM:  Summation  of  partial  products 

MA:  Multiply  by  Alpha  and  add  state  vector 

SQ:  Squash  function 


Performance  Results 


100,000  iterations  on  each  SPE 


Algorithm  Configuratione 

Runtime 

Performance 

M_V 

SUM 

MA 

SQ 

DMA 

(S) 

(GFLOPS  /  SPE) 

v' 

v' 

✓ 

y 

3.22 

10.3 

v' 

v' 

2.32 

14.3 

v' 

v' 

2.20 

15.0 

2.14 

15.3 

1.79 

18.1 
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Further  Optimization 


•  Memory  I/O 

—  Only  reached  12.2  GB/s  in  previous  results 

—  Solution:  Reconfigured  and  recompiled  Linux  kernel  to 
support  large  memory  page  size 

—  Result:  24.4  GB/s  memory  I/O  speed;  14.1  GFLOPS  per  SPE 

*  SPE  programming 

—  Operations  in  “SUM”  step  is  not  vectorizable. 

—  Solution:  New  algorithm  that  eliminates  the  “SUM”  step. 

-  Goal:  18  GFLOPS  per  SPE 
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Conclusion 


*  Brain-state-in-a  Box  (BSB)  cognitive  models  have  been  optimized  for 
both  FPGA  and  Cell  implementations 


*  >19  GOPS  demonstrated  on  6M  gate  Virtex  2  FPGA  on  Heterogeneous 
HPC  at  Rome 

*  >14  GFLOPS  (55%  of  peak)demonstrated  so  far  on  each  SPE  of  Cell  BE 

—  85  GFLOPS  on  the  6  SPEs  of  the  Cell  in  a  PS3 

—  170  TFLOPS/$M  price  performance  at  $499  per  PS3 

*  Expect  to  demonstrate  18  GFLOPS  (71%  of  peak) 

*  Compute/IO  ratio  of  ~20  adequate  to  balance  Cell  within  PS3  for  high 
performance 

—  23.4  ops/IO  balances  150  GFLOPS  to  25.6  GB/sec  RDRAM 

*  FPGAs  hard  pressed  to  match  the  price  or  performance  of  the  Cell 
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